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We consider settings where data are available on a nonparametric 
function and various partial derivatives. Such circumstances arise in 
practice, for example in the joint estimation of cost and input func- 
tions in economics. We show that when derivative data are available, 
local averages can be replaced in certain dimensions by nonlocal aver- 
ages, thus reducing the nonparametric dimension of the problem. We 
derive optimal rates of convergence and conditions under which di- 
mension reduction is achieved. Kernel estimators and their properties 
are analyzed, although other estimators, such as local polynomial, 
spline and nonparametric least squares, may also be used. Simula- 
tions and an application to the estimation of electricity distribution 
costs are included. 

1. Introduction. We consider settings where data are available on a non- 
parametric function and various partial derivatives. For example, suppose 
data (Xu, X2i,Yi,Yu), i = 1, . . . , n, are available for 

i \ , dg(x 1 ,x 2 ) 
y = g(x 1 ,x 2 ) + e, yi = — hei- 

Then g can be estimated at rates as though it were a function of a single 
nonparametric variable, rather than two. Heuristically, the presence of data 
on the partial derivative with respect to x\ eliminates the need for local 
averaging in the x\ direction. This, in turn, results in dimension reduction 
and suggests the possibility of estimating g and its derivatives at relatively 
fast rates. 
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It is natural to ask whether data on derivatives would be available in 
practical settings, or whether this investigation is esoteric. In fact, such 
data are commonly available in economics. The underlying reason is that 
economic models frequently assume that agents economize, that is, that 
they implicitly or explicitly solve constrained optimization problems. Thus, 
data may not only be available on an objective function, but also on first 
order conditions related to the optimization problem. An example serves to 
illustrate the point. 

Consider y = g(Q,r,w) + e, where y is the minimum cost of producing 
output level Q given r and w, the prices of capital and labor, respectively, 
and e is a residual. By the envelope theorem or, equivalently, Shephard's 
Lemma (see, e.g., [17, 23]), the partial derivatives of g with respect to r and w 
yield the optimal quantities of capital and labor required to produce Q. Joint 
estimation of cost functions and their partial derivatives (i.e., the inputs) 
using parametric models is routinely undertaken (see, e.g., [10, 14]). Florens, 
Ivaldi and Larribeau [7] analyze the behavior of parametric approximations 
of systems such as the ones considered in this paper. However, nonparametric 
estimation as proposed here has received little attention. 

Quite different examples of the same type arise in engineering settings, 
for example, in real-time records of certain types of motion sensors and in 
modeling problems connected to stochastic control; see, for instance, [16, 19]. 

Rates of convergence for nonparametric regression (e.g., [20, 21]) often 
limit the usefulness of conventional nonparametric models in fields where 
regression modeling involves multiple explanatory variables. Several devices 
are available to mitigate this curse of dimensionality. They include additive 
and varying-coefficient models (see, e.g., [3, 11, 12, 22]), projection-based 
methods (e.g., [4, 9, 13]), and recursive partitioning and tree-based meth- 
ods (e.g., [2, 8, 26]). For the most part, these approaches fit "abbreviated" 
models, where components or interactions among components are dropped 
in order to reduce the variability of an estimator. We shall show that incor- 
porating derivative information can yield lower variability and faster con- 
vergence rates for the full underlying regression function, without any need 
for abbreviation. 

Methodology based on this idea can be expected to reach beyond exam- 
ples in economics and engineering such as those given earlier. Particularly 
with the development of new technologies which allow rates of change to be 
recorded at discrete times, systems in the physical and biological sciences 
offer opportunities for dimension reduction using derivative data. For exam- 
ple, in meteorology, each of barometric pressure, wind speed and direction 
(the latter two being functions of the pressure gradient) are measured over 
broad geographic regions. In some fields, an evolving system is often mod- 
eled as a (possibly constrained) optimization problem, so one might expect 
data relating to first order conditions to be available there. 
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This paper is organized as follows. Section 2 outlines our assumptions and 
provides results on optimal rates of convergence. The approaches to dimen- 
sion reduction addressed there are nonstandard. Section 3, which shows that 
suitably constructed kernel estimators achieve optimal rates of convergence, 
uses familiar smoothing methods surveyed by, for example, Wand and Jones 
[25], Fan and Gijbels [5] and Simonoff [18]. We also note that the idea of 
combining local and nonlocal averaging has been used by Linton and Nielsen 
[15] and Fan, Hiirdle and Mammen [6]. Results of Bickel and Ritov [1] on es- 
timators that are constructed by "plugging in" root-n consistent estimators 
of functions are more distantly related. Section 4 describes results of simula- 
tions and an empirical application involving data on electricity distribution 
costs. Proofs of propositions are deferred to the Appendix. 

Before proceeding, it may be useful to illustrate our results on rates of 
convergence. Let g(x\, X2, x^, X4) be a nonparametric function for which we 
have data and consider the following hierarchy of functions where super- 
scripts denote multiple first order partial derivatives: 

\ 

\ 9^ W) 
\ 

0(1,1.1,0) \ 0(1.1,0,1) 0(1,0,1,1) 

\ 

0(1,1,0,0) 0(1,0,1,0) 0(0,1,1,0) \ 0(1,0,0,1) 0(0,1,0,1) 

\ 

0(1,0,0,0) 0(0,1,0,0) 0(0,0,1,0) \ 0(0,0,0,1) 

\ 

(a) If data are available on the complete hierarchy, then g can be estimated 
root-n consistently — that is, the "nonparametric dimension" of the estima- 
tion problem is zero, (b) If data are available on all multiple first order 
partials for any subset of p variables, then the nonparametric dimension 
is 4 — p. For example, if one observes all partials below the main diagonal, 
then the nonparametric dimension is one. (c) If data are available on all mul- 
tiple first order partials for any subset of p variables, except those appearing 
in the bottom t rows, then the nonparametric dimension is 4 — (p — £). For 
example, if one observes all partials in the northwest wedge, then the non- 
parametric dimension is two. (d) For an arbitrary set of observed partial 
derivatives, an upper bound on the nonparametric dimension of the estima- 
tion problem may be determined by using (b) and (c) to find the subset 
which yields the lowest nonparametric dimension. For example, if one ob- 
serves all simple first order partials, that is, all partials in the bottom row, 
then the nonparametric dimension does not exceed three. If, in addition, one 
observes 0(i'i>°>°) ) then the nonparametric dimension does not exceed two. 
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2. Properties underpinning methodology. 

2.1. Main theorem about junctionals. For simplicity, we shall assume 
that g is supported on the unit cube TZk = [0, l] fc , although substantially 
more general designs are possible. Let A denote the set of all sequences 
a = (a±, . . . , oik) of length k consisting solely of zeros and ones. Given a G A 
and x = (x±, . . . , x^) G TZk, define \a\ = J2j a j an d 

9 (I) = «*. ■■*<,„,' 

where i\ < ■ • ■ < i\ a \ denotes the sequence of indices i for which a, = 1. 

Let Bk denote the class of bounded functions on TZk and let Qf. denote the 
class of functions g on TZk for which g a G Bk for each a € A. Given C > 0, 
let /C(C) denote the class of functionals tp that may be represented as 



(2-1) {ipg){x)= x(u,x)g(u)du for all g G 

where the function x (which determines ip) satisfies sup u x& -ji \x(u,x) \ < C. 



Theorem 1. There exists a set of functionals {ip a ,a G A} C /C(l) suc/i 
i/iai /or a// g £Gk, 

A proof of this theorem and explicit formulae for the functionals tp a are 
given in Appendix A.l. 

To appreciate the implications of Theorem 1 for inference, assume that 
for each a G A, that is, for each model y a = g a (x) + e a , we have data pairs 
(X ail Y ai ) generated by 

(2.2) Y ai = g a (X ai ) + e m , l<i<n a , 

where the X a iS are distributed on TZk with a density f a that is bounded 
away from zero there and the errors s a i are independent with zero means 
and bounded variances, also independent of the A^'s. Put n = m\n a& A n a- 
It follows from the form of the functional ip a [see (2.1)] that from these 
data, we may construct an estimator i\) a g a of ip a g a that is root-re consistent 
whenever g £Qk- 

For example, if the X a iS are uniformly distributed on TZk and if ip = ip a 
is given by (2.1) with x there denoted by Xa, then an unbiased root-n 
consistent estimator of ip a 9 a is given by vp a g a , where 

T^ot 

(2.3) (ip a g a ){x) = — Y,Y*iXa(X ai ,x). 
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Theorem 1 now implies that 

(2-4) 5=E^ 

is a root-n consistent estimator of g. Properties of estimators such as tp a g a 
and g will be discussed in Section 3.2. 

The theory that we develop admittedly does not address the "cost" of 
sampling data on derivatives. In the examples from economics and engineer- 
ing discussed in Section 1, the cost is low, although in some other problems 
it is prohibitively high. Moreover, if high order derivative information is ab- 
sent, then our estimators simply do not enjoy fast convergence rates. We 
characterize convergence rates in terms of the value of n = min^ G ^ np and 
do not dwell on the fact that if there is a sufficiently large order of magnitude 
of data on (X,Y) alone, sufficiently greater than n, then the convergence 
rate of a conventional nonparametric estimator based solely on those data 
can be faster than the rates given in this paper. 

The assumption that errors for measurements of different derivatives are 
independent can be significantly relaxed without affecting the theoretical 
results that we shall give in Section 3. The assumption may not be com- 
pletely plausible in the setting of capital and labor costs, but it is realistic 
in the context of engineering problems, where motion sensor data on func- 
tions and their derivatives are estimated by different sensors with different 
characteristics. Correlations among the errors for different functions will be 
permitted in the simulation study in Section 4. 

In the following examples, the decomposition of g provided in Theorem 1 
is rearranged to illustrate root-n consistent estimation. 

Example 1. Suppose k = 1 and that noisy data are available for g(x) 
and dg(x)/dx. Write g(x) =g 1 (x) +g Q (-), where 

g(x)= [ X ^du = g(x)-g(p), 
- 1 Jo du 

g (-)^j\g(u)-g 1 (u)}du = g(0). 

The function g can be estimated root-n consistently, in which case its in- 
tegral and hence g Q can too. 

Example 2. Suppose k = 2 and that noisy data are available for g^ l,l \ 
0(i,o) ) 0(o,i) and ^(0,0) = 0_ Write 0^ _ 0^ ( Xi j X2 j + £ io ( Xl , .) + X2 ) + 

9 Q0 (;■), where 
g n (x!,X2)=J g { - 1 ' l \ui,u 2 )duidu 2 



P. HALL AND A. YATCHEW 
= g(xi,x 2 ) -g(x!,0) -g(0,x 2 ) + c/(0,0), 

g 10 (x\, •) = J g( 1 '°\u 1 ,x 2 )duidx 2 -J g n (x\, x 2 ) dx 2 
= g(x 1 ,0) - 9 (0,0), 

fl fX2 fl 

5 01 (-^2)=/ / g( ' 1 \x 1 ,u 2 )dxidu 2 - g u (x 1 ,x 2 ) dxi 



= g(o,x 2 ) -5(0,0), 

9 00 {-,-) = J^J^ g(xi,x 2 ) -g u (x 1 ,x 2 ) ~9 w {xi,-) - g J}l {-,x 2 )dx 1 dx 2 
= 9(0,0). 

Sample analogues of all integral expressions can be calculated without local 
averaging. Thus, g and its integrals can be estimated root-n consistently, 
in which case g 1Q and g , their respective integrals and g can too. 

2.2. Application of Theorem 1 to lower- dimensional structures. Let 1 < 
p<k and consider a lower-dimensional "subspace" of A, specifically the set 
B of all sequences (3 = (/?i , . . . , (3 P ) of length p consisting solely of zeros and 
ones. Given g e B k , define and g^ analogously to \a\ and g a . In particular, 
g@ is a function on TZ k , not on the lower-dimensional space 1Z P = [0, l] p and 

(2-5) <Ax) ^ 



where i\ <■■ ■ < i\m denotes the sequence of indices i for which = 1. Sim- 
ilarly, although the functional V>/3 (the p-dimensional analogue of ip a intro- 
duced in Theorem 1) would normally be interpreted as the functional which 
takes b £ Bp to ippb, defined by 



(ip /3 b)(x 1 ,...,x p ) = / x{u\,---,u p ,x 1 ,...,x p )b{u 1 ,...,u p )du 1 --- du p , 

JTlp 

it can just as easily be interpreted as the functional that takes g € B k to 
ij)pg, defined by 



{^pg){x\,...,x k )= / x(ui,---,Up,x 1 ,...,x p ) 
Jn p 

(2.6) x g(m, . . . ,u p ,x p+ i, . . . ,x k ) dui ■ ■ ■ du p . 

We shall adopt the latter interpretation. 

We may, of course, interpret (5 as a /c-vector and an element of A, with its 
last k — p components equal to zero. We shall take this view in Section 2.3, 
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where we shall treat cases that cannot be readily subsumed under a model 
in which noisy observations are made of ippg^ for each (3 6 B. 

Let Qk p denote the class of functions g € Bk for which g@ is well defined 
and bounded on IZ^ for each (3 € B. The following result is an immediate 
corollary of Theorem 1. It is derived by applying Theorem 1 to the function 
that is defined on 1Z P and is obtained from g by fixing the last k — p coor- 
dinates of x and allowing the first p coordinates to vary in 1Z P . However, 
although Corollary 1 can be proved from Theorem 1 , the theorem is a special 
case of the corollary. 

Corollary 1. Assume 1 < p < k and let ipp, for (3 G B, denote the 
junctionals introduced in Section 2.1, but interpreted in the sense of (2.6). 
Then for each g € Qkp, 

(3eB 

The main statistical implication of the corollary is that by observing data 
on g" for each (3 € B, we reduce the effective dimension of the problem of 
estimating g from k to k — p. The manner in which g depends on its first p 
components can be estimated root-ra consistently and then performance in 
the estimation problem is driven by the difficulty of determining the way in 
which g is influenced by its last k — p components. 

To better appreciate this point, assume that for each (3 £ B, data (Xpi,Ypi) 
are generated by an analogue of the model at (2.2), 

(2.7) Yp i = g f> (Xf H )+e0 i , l<i<np, 

where g G Q\. p and the Xg^s are distributed on TZj.. Suppose, for simplicity, 
that the common density of the Xpt's is uniform on TZ^. Let X^ p ' and 
represent the (k — p)-vectors comprised of the last k—p components of X^i 
and x, respectively. Denote by K a (k — p)-dimensional kernel function, let 

h be a bandwidth and in close analogy with (2.3), define typgP by 

1 n ' 3 /X [k ~ p] -x^ p h 

(2.8) U,pgP)(z) = E Y m (X pi ,x)K j- j • 

Set n = ming g £ ng. It is readily proved that if (i) g has d derivatives of its 
last k—p components as well as all multiple first derivatives of its first p com- 
ponents, (ii) K is a bounded, compactly supported, dth order kernel, (iii) x is 
an interior point of 1Z)~, so as to avoid edge effects and, (iv) h = h(n) ~ 

const • n l l <y2d+k ~~ p \ then (tp^g^)(x) converges to (ip/3g^)(x) at the standard 
squared-error rate, n ~ 2d / ( 2d + k -p) ^ f or estimating functions of k — p variables 
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with d derivatives. This result is a consequence of the fact that the smooth- 
ing at (2.8) is only over the last k — p coordinates of the data Xpi. Therefore, 
the estimator 

(2-9) g=J2^h?, 

analogous to that at (2.4), converges to g at the squared-error rate n - 2d /( 2d + k -p) . 
Properties of ippg 13 and g will be discussed in Section 3.2. 



Example 3. Returning to the example in the introduction, suppose k = 
2 and that noisy data are available for g and g( 1,0 ) . Write g(x) = g (x\ , x 2 ) + 
9 10 (;V2), where 



a u (xi,x 2 ) = I g( 1 >°Xu 1 ,x 2 )du 1 = g{x 1 ,x 2 ) -g(0,x 2 ), 



Hi 1 

9_ m {-i x 2) = j Q {g(xi,x 2 )-g n (x 1 ,x 2 )}dx 1 =g(0,x 2 ). 

Estimates of g n and g require local averaging in the x 2 direction only. 
Thus, j j can be estimated at one-dimensional optimal rates, in which case 
its integral and g can too. 

Example 4. Suppose k = 3 and that noisy data are available for f/ 1 ' 1,0 ) , 
^(1,0,0)^(0,1,0) an d^(0A0) =g Write g( x ) = g ul (x 1 ,x 2 ,x 3 ) +9 101 {x u -,x 3 ) + 
9 0U (-,X2,X3) + g Q01 (-,-,x 3 ), where 

9 111 (xi,x 2 ,x 3 ) = g( 1 > 1 '°\u 1 ,u 2 ,x 3 )du 1 du 2 

= g{xi,x 2 ,x 3 ) -g(x 1 ,0,x 3 ) - g(0, x 2 ,x 3 ) + g(0,0, x 3 ), 

£ 1Q1 (xi,-,x 3 ) = J g( 1 '°'°\ui,x 2 ,x 3 )du 1 dx 2 -jf g ni (x 1 ,x 2 ,x 3 )dx 2 

= g(xi,0,x 3 ) -g(0,0,x 3 ), 
g J0n (-'X2,x 3 )= / g i0 ' 1 '°\x 1 ,u 2 ,x 3 )dx 1 du 2 - g ni (x 1 ,x 2 ,x 3 ) dxi 



10 JO 

= g{0,x 2 ,x 3 ) -g(0,0,x 3 ), 

ioo^"'"'^) - J J {g{xi,x 2 , x 3 ) -g ni (xi, x 2 , x 3 ) -g_ wl (xi, -,x 3 ) 
-g 01l (-,X2,x 3 )}dx 1 dx 2 
= g(0,0,x 3 ). 
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Estimates of each of the above component functions require local averaging 
in the X3 direction only. Thus, g and its integrals can be estimated at one- 
dimensional optimal rates, as can g and <7 011 , their respective integrals 
and hence also o_ ni . 

—001 

With a mild abuse of notation, suppose that X3 in Example 4 is of length 
k — 2. Then g can be estimated at (k — 2)-dimensional optimal rates. 

2.3. More general settings. In Corollary 1, we assumed that we have 
available all multiple first derivatives g@ of the first p components of g. Our 
restriction to the first p components was made only for notational conve- 
nience; they could have been any p components. In particular, we may alter 
the definition at (2.5) to 



(2.10) <f{x) 



dx I{il ydx I{iw) 



where 1(1) <■ ■■ < I(p) denotes any given subsequence of length p of 1, . . . , k, 
without affecting the validity of the corollary. The functional ipg would be 
interpreted analogously. Taking this view (which we shall in the present 
section), we may interpret (3 as a /c-vector. 

Low-dimensional cases, such as that treated by Corollary 1, are motivated 
by circumstances where multiple first derivatives are observed for a subset 
of variables. It may be that one is able to observe data on g a for all a € A 
such that \a\ > £, say, but not for any other values of a. This case is not 
immediately covered by Theorem 1 or Corollary 1, which can be viewed as 
treating the contrary setting \a\ <£. 

We shall adopt the general setting discussed in the paragraph containing 
(2.10) so as to stress the wide applicability of our results. Assume 1 <p < 
k, < £ < k and l<p — £+l<k and suppose that we have derivative 
information from components in V = {1(1), ■ ■ ■ ,I(p)}. Let j3 and g@ be as 
in (2.10) and assume that we have noisy data on g& for all £ B such 
that \(3\ > £, as well as for (3 = 0; see (2.7). Then we may construct an 
estimator of g, closely analogous to that at (2.9) and enjoying the squared- 
error convergence rate n - 2d /( 2d + k -<i) ; where q = p — £ + 1. That rate is valid 
under the assumption that g has d bounded derivatives. 

This result is a consequence of Theorem 2 below, for which we now give 
notation. Given a € A, u, x E TZk and a function b E Bk, let i±, . . . , i\ a \ denote 
the indices of the components of a that equal 1. Define v a (u,x) to be the 
/c-vector with in position ij for 1 <j < \a\ and Xj in position j for each 
j that is not among ,i\ a \. Define the operator M a by 



(2.11) (M a b)(x)=l - I b{v a (u,x)}du h --- dui 
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Consider the functional that takes g to the function of which the value at x 
is 

(2.12) / \ a {u 1 x)g a {u)du 1 

where £ a (u,x) is a function of the 2k variables among the components of u 
and x. In Appendix A. 2, we shall prove the following result. 

Theorem 2. // g eQ k , 1 <p <k, <l <k and 1 <p-l + l < k, then g 
can be expressed as a linear form in integrals of the type (2.12), where \a\ > £, 
all components of a that equal 1 are indexed in V and sup ua . 6K |£ Q (u, x)| < 
C, with C > depending only on k, I and p, and in integrals Mpg, with 
(3eA and\p\>p-£+l. 

Our derivation of Theorem 2 will provide an inductive argument for cal- 
culating the representation of g in any given case. 

To appreciate how the convergence rate given three paragraphs above 
follows from Theorem 2, let us consider the case p = k, for simplicity, and 
express g as indicated in the theorem: g = gi + g2, where 

r „ s 

9i(x) = J2 ia{i){u,x)g ai - l \u)du, g 2 (x) = J2ci (M /3{i) g)(x). 

i=l JU k i=l 

Here, sup \(, a {i)( u ' x )\ ^ const., the q's are constants and a(i),[3(i) € A with 
\a(i)\ > i and > k — I + 1. Assuming, for simplicity, that the design 

points are uniformly distributed, we may construct the following root-n con- 
sistent estimator of g\[x) using the approach at (2.3): 

9l( x )=^2 X! Y a{i)j£,a(i)(X a (i)j,x). 

i=l n a(i) j=1 

We may estimate g2(x), using the method at (2.8), as follows: 

Here X^,^- an d x* denote the vectors of those k — \f3(i)\ components of 
and x, respectively, for which the corresponding components of f3(i) 
are zero. Since k — \(3(i) \ < I — 1 for each i, the squared-error convergence 
rate of 52 to 52 is n~ 2d ^ 2d+i ~ 1 \ Therefore, the squared-error convergence 
rate of g = g\ + §2 to g is also n - 2d /{ 2d + t - 1 ) j as claimed three paragraphs 
above. 
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Example 5. Suppose k = 2 and that noisy data are available for f/ 1 ' 1 ) 
and gf(°'°) = g. Use the root-n consistent estimator of g u from Example 2 to 
write 

2/(o,o) -£ n (^1,^2) =5(^1,0) + 9(0,22) -9(0,0) +O p (n~ 1/2 ) + e (0i0) , 

which is additively separable in x\ and x 2 and hence estimable at one- 
dimensional optimal rates. 

Example 6. Suppose k = 3 and that noisy data are available for gi 1 ' 1 ' 1 ) , 
5 (i,i,0) ) 5 (i,o,i) ) 3(0,1,1) and g mo)= g . Define 

rxi rx 2 rx 3 

9 u Axi,x 2 ,x 3 ) = / / g ( ' ' >(u 1 ,u 2 ,u 3 )duidu 2 du 3 
Jo Jo Jo 

= g(x 1 ,x 2 ,x 3 ) -g(x±,x 2 ,0) -g(xi,0,x 3 ) -g(0,x 2 ,x 3 ) 
+ g{x u 0, 0) + 5 (0, x 2 , 0) + 5 (0, 0, x 3 ) - g(0, 0, 0) , 

g UQ (x 1 ,x 2 , •) = y J g( 1 ' 1 '°\u 1 ,u 2 ,x 3 )duidu 2 dx 3 
1 

g lu (x 1 ,x 2 ,x 3 )dx 3 
= g(x 1 ,x 2 ,0) - g(x u 0, 0) - g(0, x 2 , 0) + g(0, 0, 0), 

g im (x lr ,x 3 ) = j^J^ g ( - 1 '°' 1 \ui,x 2 ,u 3 )duidx 2 du 3 



J Q 9 ni ( x i,x 2 ,x 3 )dx 2 



= g( Xl , 0, x 3 ) - g( Xl ,0,0)- g(0, 0, x 3 ) + g(0, 0, 0) , 
g 0U (-,x 2 ,x 3 ) = J^J^ g {0 ' 1 > 1 >(xi,u 2 ,u 3 )dxidu 2 du 3 

9m (x 1 ,x 2 ,x 3 )dxi 



= 9(0, x 2 ,x 3 ) -9(0, x 2 ,0) -9(0,0, x 3 ) +9(0,0,0). 

Sample analogues of all integral expressions may be calculated without local 
averaging. Thus, 9 and its integrals can be estimated root-n consistently, 
as can g,,„,o,„, and (?„,.. Now, write 

£.110—101 £-011 

2/(0,0,0) ~ 9 ul (xi,x 2 ,x 3 ) - g lw (x 1 ,x 2 ,-) - g 1Ql (x lr ,x 3 ) - g Q11 (-,x 2 ,x 3 ) 

= g( Xl , 0, 0) + 9(0, x 2 , 0) + 9(0, 0, s 3 ) - 29(0, 0, 0) + Opin- 1 / 2 ) + e (om , 

which is additively separable in x\,x 2 and x 3 and hence estimable at one- 
dimensional optimal rates. 
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3. Estimation. 



3.1. Smoothing techniques. In Section 2, we gave examples of estimators 
in the case where the design points X a i are uniformly distributed on IZ^. 
More generally, we should normalize the summands of our estimators, such 
as those at (2.3) and (2.8), using estimators of the densities of the distribu- 
tions of design points. For simplicity, we shall develop the case of (2.8) in 
this setting, noting that other cases are similar. 

Suppose we observe the datasets at (2.7) for each G B, where the latter 
is the set of p- vectors of zeros and ones with 1 < p < k. Note that we may 
also interpret (3 as a A;- vector, an element of A, in which each of the last 
k — p components is zero. Both interpretations will be made below. 

The design points Xpi, which are /c-vectors, are assumed to be distributed 

on TZk with density fp, say. As in Section 2.2, let X^ p ^ and x^- k ~ p ^ denote 
the (k — p)-vectors consisting of the last k — p components of Xpi and x, 
respectively, let K be a (k — p)-variate kernel function, let h denote a band- 
width and redefine 

where fp-i denotes an estimator of fp computed from the dataset Xp,-% = 
{Xpi , . . . , Xp nf3 } \ {Xpi} obtained by dropping the ith observation. Note that 
Xp(Xpi,x) depends only on the first p components of Xpi and x, whereas 
fp-i(x) and fp(x) depend nondegenerately on all k components of x. 

A degree of interest centers on the definition adopted for fp~i- We shall 
discuss an edge-corrected kernel method, but, of course, there are many 
other techniques that can be used — for example, polynomial interpolation, 
or polynomial smoothing, applied to binned data. 

Let H > denote a bandwidth and let L\ represent a bounded func- 
tion of a real variable t, supported on the interval [—1,1] and satisfying 
/ 1 3 K\(t) dt = 5jo (the Kronecker delta), for < j < di — 1. (The positive in- 
teger d\ may differ from the order d of the kernel K.) Construct a /c-variate 
product kernel L, 

(3.2) L(m, ...,u k ) = Li(«i) • • • Li(ttfc). 

The density estimator 

~ X (3j 



(3-3) fp,-i(x) = J2 L v 

does not suffer edge effects provided Xi G [h, 1 — h] for 1 < i < k. However, if 
for one or more values of i, Xi lies outside [h, 1 — h] and, more particularly, 
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if < Xi < h, then edge effects may be averted by replacing L\{ui) with 
L e dge (^i) in the definition of L at (3.2). Here, L c dge is a bounded, univariate 
edge kernel, supported on [0,1] and satisfying / P L e d ge (i) dt = 5jo for < 
j < d\ — 1. Similarly, if 1 — h< Xi < 1, then we would replace L\, applied to 
the ith component in (3.2), by an edge kernel supported on [—1,0]. 

With these modifications, the density estimator fa —i defined at (3.3) is 
of cqth order and does not suffer edge effects in TZk- 

Our definition of fp-i ensures that the estimator at (3.1) is protected 
from edge effects in the first p coordinates of x. However, we should modify 

K in the same way as we did L; otherwise, ifipg^ will suffer from edge 
effects in the last k—p coordinates of x. We shall assume that this has been 
done so that the {k — p)-variate kernel K is, analogously to L, a product of 
k — p bounded, compactly supported, dth order univariate kernels that are 
switched to appropriate edge kernels if one or more components of x^ k ~ p ^ are 
within h of the boundary. The univariate kernels, K\ and LT e dge> say, will 
each be taken to be of dth. order. 

Rather than employ special kernels to overcome edge effects, we may 

use local polynomial methods to construct ippg 13 , obtaining an alternative 
estimator to that at (3.1). In this approach, we would run a (k — p)-variate 
local polynomial smoother of degree d — 1 through the data pairs 

(3-4) (xf-vlYpiXpiX^/fa-iiXpi)), l<i<n p . 

This technique is also able to correct for a nonuniform joint distribution of 
the last k — p components, so we could normalize the "response variable" 
a little differently than by dividing by fj3,-i(Xpi), as at (3.4). However, the 
normalization at (3.4) causes no problems for the local polynomial smoother. 

3.2. Limit theory for estimators. For the sake of simplicity, we shall give 
theory only for edge-corrected kernel approaches to estimation. In partic- 
ular, we assume fp~i is constructed using the methods described in Sec- 
tion 3.1, that the univariate kernel L\ and its two edge-correcting forms 
Lgdge are bounded and compactly supported and that the same is true of 
the univariate kernels K\ and LT e dge that are multiplied together to give the 
(k — p)-variate kernel K. To this, we add the assumption that 

K±, -fQdgci Li and L c d K0 are Holder 

(3.5) 

continuous as functions on the real line. 

Recall that the estimator fp —% is constructed using a dith order kernel 
L and a bandwidth H and that the kernel K used in (3.1) is of order d 
and employs a bandwidth h. Of these quantities, we assume the following 
conditions: 



(3.6) 



d>2{k + p) and d\>k, 
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for constants < C\ < C2 < 00 and 77 > 0, 



1 J „{-V(2fc)}+r, < tt < r • r -1/ 2<2i) -I (2d+k-p) x -r, 

for all sufficiently large np. 



Provided (3.6) holds, we may choose h and H satisfying (3.7). We also 
suppose that 

is bounded, the last k — p components of g have d continuous 

(3.8) derivatives and fp has d\ bounded derivatives and is bounded away 
from zero on TZk- 

We also make the following basic "structural" assumptions: 

data pairs [Xpi, Ypj) are generated by the model at (2.7), in which the 
design variables Xpi are independent and identically distributed on 

(3.9) TZk with density fp, the errors spi are independent and identically 
distributed with zero mean and the errors are independent of the 
design points. 

From these data, construct the estimator ippgP defined at (3.1). Recall that 
u l k ~p] denotes the (k — p)-vector consisting of the last k — p components of 
the A;-vector u. Let w(u,x \ h) represent the fc-vector with Uj in position j 
for 1 < j < p and xj + hjUj in position j for p + l<j<k. 

Theorem 3. Assume l<p<k, that conditions {3.5)~{3.9) hold and 
that the distribution of the errors epi has zero mean and all moments finite. 
Then 

(ippgf 3 )(x) = / g{w(u,x \ h)}xp{w(u,x\h),x}K(u^ k ~ p] ) du 



np ^ fp(Xpi) h k ~P V h 
• P {n 



uniformly in x € TZk ■ 



We shall discuss the implications of Theorem 3 in the two main cases, p = 
k and p < k. In the first setting, the contribution of the kernel K to (3.10) is 
degenerate and the integral on the right-hand side is identical to {ippg^){x). 
(Here, is a fc-vector.) Therefore, when p = k, (3.10) is equivalent to 

(3.11) bh?){x) = {i>p9 P ){x) + Z np (x) + o p (n~ 1/2 ), 
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uniformly in x € TZk , where 

is a zero-mean stochastic process defined on TZ^. As increases, n l J 2 Z ni3 
converges weakly to the Gaussian process Zq, say, with zero mean and co- 
variance function 

(3.12) cov{Z (x 1 ),Z Q (x 2 )} = al / xp{u, xx)xp(u, x 2 ) fp{u)~ l du, 

where cr| = var(egj). This property and (3.11) together imply that ijjpgP 
converges uniformly to iftagP at rate re -1 / 2 : 

sup \bh?)(x) - {i>pg P ){x)\ = O p (n- 1/2 ). 
xeiz k 

Next, we treat the case p < k. Although Xp( u -> x ) is discontinuous as a 
function of the first p components of u, if g has d continuous derivatives of 
its last k — p components, then so too does Xb{'i x )'i see the definition of Xa 
given in Appendix A.l and recall that definition has a minor adaptation to 
the case of X/3- Therefore, standard Taylor expansion methods may be used 
to prove that for a continuous function a, 



g{w(u, x | h)}xp{w(u, x \ h),x}K(v} k p ^)du 

g{w(u, x),x}xp{w(u, x), x] du\ ■ ■ ■ du p + h d a(x) + o(h d ) 



(3.13) 



as h — > 0, where w(u, x) = w(u, x\0) is the /c-vector with Uj in position j for 
1 < j < k and xj in position j for p+1 < j <k. The series on the right-hand 
side of (3.10) is asymptotically normally distributed with zero mean and 
variance (n^h k ~ p )~ 1 a'gT(x) 2 K, where 

t{x) 2 = \ Xi3{w(u,x),x} 2 f l3 {w(u,x)}~ 1 dui---du p 



and k = J K 2 . This result, (3.10) and (3.13) collectively imply that for the 
choice of h given in (3.7), (ippgP)(x) converges to (ippgg)(x) at the pointwise 
squared-error rate n 2d /( M + fe p >^ as claimed in Section 2.2. The uniform 
convergence rate is slower only by a logarithmic factor. 

It is straightforward to prove that the pointwise rate is minimax optimal. 
Indeed, that property follows from conventional minimaxity results in non- 
parametric regression on taking g to be a function of which the dependence 
on the first p coordinates is degenerate. Likewise, the uniform convergence 
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rate can be shown to be optimal, provided we use a slightly larger bandwidth 
h, increased by a logarithmic factor relative to that asserted in (3.7). 
We close by formally stating the main results discussed above. 

Corollary 2. Assume the conditions of Theorem 3. If p = k, then 

n 1 J 2 {ippg l3 (x) — ippgP(x)}, viewed as a stochastic process indexed by x G TZu, 
converges weakly, as np — > oo, to a zero-mean Gaussian process Zq with co- 
variance function given at (3.12). If p < k and if h ~ const ■ n~ 1 ^ 2d+k ~ p \ 

then for each x £ TZk, 2d+k {ifipgPfa) — tppg^(x)} is asymptotically nor- 
mally distributed with finite mean and variance. 

Of course, in order to construct an estimator g of g, we must add ippg^ 
over all /?; see (2.9). The resulting limit theory for g is the superposition 

of that for each ifrpgP. However, provided the sets of design points Xpi and 
errors epi are independent for different /3's, properties of the superposition 
are readily derived from the results that we have already obtained for a 
single (5. 

Indeed, under this assumption of row-wise independence, it follows di- 
rectly from Corollary 2 that if, for a sequence of integers n diverging to in- 
finity, np/n converges to a strictly positive constant eg for each (3 € B, then 
(a) if p = k, n 1//2 (g — g) converges weakly to a zero-mean Gaussian process 
defined on TZk and (b) if p < k, then for each x € TZk, n d ^ 2d+k ~ p ^ {g(x) — g(x)} 
is asymptotically normally distributed with finite mean and variance. 

Correlation among residuals in different equations can also be accommo- 
dated. Let B = {/3i,...,(3 s }. Suppose (Xi, Yp lh . . . ,^a s i) i=1)>< are indepen- 
dent and identically distributed, where Y^.j = gM (Xi) +e^i, j = 1, . . . , s and 
ajjf = cav(ep.i,£p.,i). Let f(x) denote the design density of the Xi which are 
distributed independently of the residuals. Then conclusions (a) and (b) of 
the previous paragraph continue to hold with the covariance function of the 
limiting Gaussian process in (a), say Zq, given by 



4. Numerical results. 

4.1. Simulation of cost function and input factor estimation. We return 
to the cost function estimation problem discussed in Section 1. Since dou- 
bling of input prices at a given level of output doubles costs, the cost function 
is homogeneous of degree one in input prices. Thus, we may write average 
costs, that is, costs per unit of output Q, as AC = r g(Q,w/r), where r and w 



cov{Z (xi),Z (x 2 )} = 

n' 
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are the prices of capital and labor, respectively. Applying Shephard's Lemma 
yields the average labor function, AL = dAC/dw = dg(Q,w/r)/d(w/r). If 
noisy data are available for AC and AL, then this application is analogous to 
Example 3 above, except that the nonparametric function g is multiplied by 
r, a feature which arises from the degree-one homogeneity of cost functions 
in their factor prices. 

We calibrate our simulations using the Cobb-Douglas production function 
Q = cK Cl L C2 (see, e.g., [23]). The data-generating mechanism for average 
costs is 



(4.1) y(o,o) = AC + e (0)0 ) = rcQ 1-1 +e (0 ,o), 

where c= ((ci/c2) C2 ^ Cl+C2 - ) + (ci/c2) _Cl//( - Cl+C2 - ) )c c i+ c 2 . For average labor, we 



use 



Co 1 - c l- c 2 / Ul \ ci+c 2 

(4.2) 2,(0,!) = AL + £(0)1) = ~^^Q C1+C2 [-) +£(o,i)- 

In the simulations below, we set c\ = 0.8 and C2 = 0.7. Data for Q and for 
the ratio of factor prices w/r are generated from independent uniform dis- 
tributions on [0.5, 1.5]. We assume that £(o,o) an d £ (o,i) are normal residuals 
with zero means, standard deviations 0.35 and correlation p set to 0.0, 0.4 or 
0.9. The R 2 is approximately 0.75 for the AC equation and 0.15 for the AL 
equation. Our reference estimator of average costs consists of applying bi- 
variate kernel smoothing to the triples (2/(0,0) / r > Qi w / r ) to obtain g(Q,w/r), 
which is then multiplied by r. 

To incorporate the labor data, define 

(4.3) g a ( r ,Q,w/r) = r-L £ Y (0A)j , 



Qj-Q\<h/2 
Wj /rj<w/r 



(4.4) 9 Ar,Q,)=r- 1 ^ Y m , - g^Q^/r,) 



nh 



}j-Q\<h/2 



Then AC = g a + g , . Table 1 summarizes our results for various sample sizes 
n and residual correlations. There, we report the mean squared errors of this 
estimator relative to the bivariate kernel estimator described above. There 
are substantial efficiency gains, which increase with sample size, as would be 
expected given the faster convergence rates of derivative-based estimators. 



4.2. Estimating costs of electricity distribution. To further illustrate the 
procedure, we use data on 81 electricity distributors in Ontario. (For addi- 
tional details, see [24].) We have data on output, Q, which is the number of 
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Table 1 

MSEs of derivative-based AC estimator relative to 
bivariate kernel smoothing 



n 


p = 0.0 


p = 0.4 


p = 0.9 


100 


0.384 


0.384 


0.374 


200 


0.277 


0.274 


0.272 


500 


0.233 


0.230 


0.228 


1000 


0.185 


0.187 


0.186 




Fig. 1. Function estimate using data on function only. 

customers served and which varies from about 500 to over 200,000. Average 
labor, AL, equals the number of employees divided by Q. In addition, we 
have data on hourly wages, w, and the cost of capital, r. 

Figure 1 illustrates the estimated average cost function using only AC 
data and a bivariate loess smoother available in S-PLUS. Next, we use both 
the AC and AL data and apply equations (4.3) and (4.4), suitably modified 
for the nonuniform distribution of w/r. Figure 2 illustrates the resulting 
estimate. 
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Fig. 2. Function estimate using data on function and partial derivative. 



APPENDIX: TECHNICAL ARGUMENT 
A.l. Proof of Theorem 1. It is readily seen that when k = 1, 

i ,i 

(A.l) g(x)=J2 Xj{u,x),gU>(u)du, 

j=o Jo 

where xo{ u i x ) = 1, Xi{ u i x ) = u ~ 1 + H u ^ x )> I( u l^x) = 1 if u < x and 
equals otherwise and g^\x) = (d/dx) J g(x). Repeating identity (A.l) for 
each component of a function g of k > 1 variables, we deduce that Theorem 1 
holds with i{} a defined by (ijj a g)(x) = Xa(u, x)g(u) du, where 

k 

X a (ui,...,U k ,X 1 ,...,X k ) = J\Xaj(Uj,Xj) 

and a= (a±, . . . ,a k ). Note, particularly, that \x a \ < 1 and so ijj a € /C(l), 
where IC(C) is defined as in Section 2.1. 

A. 2. Proof of Theorem 2. In proving the theorem, we may assume 
that V = {1, . . . , k}, since the contrary case can be treated by fixing com- 
ponents of which the index does not lie in V. In the notation at (2.11), 
define 



(N a b)(x) = I ■■■ [ H bfya^x^du^--- dui 
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Given a € A, let A(a) denote the set of vectors f3 = (/?i, . . . , f3k) € A for which 
each index j with 0j = 1 is also an index with ay = 1. Put a$ = (0, . . . , 0) and 
Ai(a) = A(a) \ {ao}. We shall prove shortly that for all a £ A and b € B k , 

(A.2) J] (-lf l M p N a b a = ]T (-l) l/3| M /3 6 

/3eA(a) /3gA(q) 

or, equivalently, 

(A.3) b= J2 {-l) m M p N a b a - (-l) m Mpb. 

/3eA(a) /3GAi(a) 

Substituting b = g and a = (1, . . . , 1) into (A.3), we obtain 
(A.4) g= Y^(-l)^M^N a g a - Yl (-^Mpg, 

where A\ = ■ • • , 1) = A\ {ao}- The first series on the right-hand side is 
a linear expression in integrals of the form at (2.12). If |/3| > k — £ + 1, then 
Mpg is also of the form claimed in the theorem. It remains only to treat 
terms Mpg with \(3\ < k — £, which we do using an iterative argument. [Note 
that, since (3 € A\(a), we have \/3\ > 1, so we have already finished if i = k.] 
Write S((3) for the set of indices i such that = 1 and define a 1 = a 1 {(5) £ 
A by S(a±) = S(a) \ S((3). Apply (A.3) again, this time with a = a 1 and 
b = Mpg, obtaining 

M p g= £ (-lf^MpN^M^gr 1 - £ (-lf^MpMpg. 

pcAta 1 ) /3 1 GAi(q 1 ) 

By definition of a 1 , (M(3g) al = (M(3) (g al ) , and so N a i (M^)" 1 = (N a i M p )(g c 

which is a /c-fold integral of g al , where \a l \ = k — |/3[ > k — (k — £) = £. Also, 
MpiMp = Mpzg, where (5 2 € A and |/3 2 | > 2. (The superscript 2 is an index, 
not an exponent.) If \(3 2 \ > k — £ + 1, we are done; if |/3 2 | < k — £, we continue 
the process of iteration. 

Finally, we derive (A.2). Again, it suffices to treat the case a = (1, . . . , 1), 
since other contexts may be addressed by fixing components Xj for j such 
that aj = 0. In the case a = (1, . . . , 1), 

(N a b a )(x) = r ■■■ f Xk b^-' 1 \u l ,...,u k )du 1 ---du k 



■y&A 

whence 

(A.5) £ (-l)^(M p N a V)(x) = £ (-1) 171 £ (" 1 ) l/31 ■)}](")■ 
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If ji = 1 and (3 = (/?i, . . . , (5k) £ A then, if we switch (5i from to 1, we do 
not alter the value of [M / g6{v 7 (0, -)}](x). Therefore, by virtue of the factor 
(-l)W below, 

£ (-1)101 [M^K(0,-)}]^0 

unless 7 = ao- However, u ao (0,u) = u, so by (A. 5), 

^(-l)!0l(M /3 iV a ^)(a;) = 5:(-l)l0l(M^)(x), 

which, in the case a = (1, . . . , 1), is equivalent to (A. 2). 

A.3. Proof of Theorem 3. The estimators fp,-i have biases and vari- 
ances that are uniformly of orders H dl and (npH k )~ 1 , respectively, and, in 
particular, 

(A.6) sup \E{fa-i{x)} - f(x)\ = 0{H d ^). 

ze7£fc,l<i<n,a 

Arguments based on Markov's inequality show that for each c, C > 0, 
(A.7) sup P{[/ A _,(x) - Ef Pt ^(x)\ > (n^iT*) 172 } = 0(^ C ). 

xg7^.fe,l<i<n^ 

The Holder continuity assumed of L may be used to prove that if C\ > is 
chosen sufficiently large, then for all C 2 > 0, 

E{ sup |/ /3 ,_ i (x 1 )-/ /3i _,(x 2 )| C2 } = 0(n^ 2 ). 

Therefore, again by Markov's inequality and for each c, C > 0, 
(A.8)P{ sup |/^„ i (x 1 )-/ A _ l (x 2 )|>n c - 1 } = 0(n^). 

Applying (A.7) on a lattice of values x £ of edge width 1 and using 

(A. 8) to bound |/^_j(a;i) — //3,-i(x 2 )| when xi is off the lattice and x 2 is the 
nearest grid point to x±, we may prove that for each c, C > 0, 

(A.9) P{ sup \fp,-i{x) - Ef Pi - i (x)\{n c p 1 H- k ) 1 l 2 } = 0(nf). 
^xen k ,l<i<nf3 ' 

Below, we shall refer to this as the "lattice argument" ; it employs the Holder- 
continuity condition (3.5). 

Taylor expanding as = f^ 1 - {fp-i - fp)fp 2 H , we may 

show that 



(A.10) f^(X^)- 1 ~ M**)" 1 = J-i^^-MM + A(3i , 
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where, by (A. 6) and (A. 9), we have for each c > 0, 



(A.ll) 



max |Afli| = 0Jn c a - 1 H- k + H dl ). 

l<i<np " 



Substituting (A. 10) into the definition (3.1) of the estimator tppgP, we deduce 
that 

(A.12) (^h?)(x) = Si(x) - S 2 (x) - S 3 (x) - S 4 (x) + S 5 (x), 
where 



Si(x 
S 2 (x 
S 3 (x 
54(2; 
S 5 (x 



») KfH{x) > 



1 g(Xpi)xp(Xpi,x){fp -i(x pi ) - K.p{Xp,j)} 



— E 



fp(Xpi 



fp{Xpi 



Kpi{x), 



1 ^ ^(^i^XM^) - 



fp{Xf3i 



K pi {x), 



Ha 



E Y t3iXp{Xj3i,x)&piKpi(l 

n P i=l 



K pi {x) = br^K{{X% p] - x^)/h} and k p (x) = E{f P -i(x)}. 

Noting that the errors epi are independent of the design points Xpi, it 
may be shown using moment methods that for £ = 3, 



(A.13) 



E \ S t( x )\=°p{ n f3 



1/2, 



-(l/2)-m 



/3 



Property (3.7) implies that the bias of /^j is of order H dl = 0( 
for some 77 > 0, whence it may be proved that (A.13) holds with £ = 4. 
Result (A.ll) and the property n c g ~ 1 H- k + H 2dl = 0( 



-(l/2)-m 



■/3 



for some 



c, 77 > 0, which follows from (3.7), together imply (A.13) with £ = 5. The 
lattice argument is used in the cases £ = 3,4,5. 

Next, we develop approximations to S 2 (x). Note that defining a(v,x) = 
H~ k L{(x — v)/H}, we have 



fp,-i( x ) 



1 



- J2 a{X pjl x). 

Tip — 1 ■ , ■ 
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Given 1 < i, j < np with j, define 

A(w) = 'Wfh^ H h 

We shall construct a [/-statistic-type projection of A{Xpi,Xpj,x) using 
Di (v, x) = E{A{X pi ,v,x)},D 2 (u, x) = E{A(u, X Pj , x)} and D 3 (x) =E{A(Xp j , 
Xpii x)}. However, D 2 = and therefore D 3 = 0, whence S 2 = T± +T 2 , where 

T 1 (x) = —J2D 1 (X f3i ,x), 



T 2 {x) 



1 n 

-^yE E {^ / 3 4 ,X /3j ,x)- j D 1 (X /3j ,x)}. 



Now, Z?i(i;,a;) = D 3 (v, x) — E{Ds(Xpj, x)}, where 



D 3 (v,x)=E 



E 



g(X IH )xp(Xp i ,x)a(v,XfH)_ K (X% lil ~ x [k ~ p] 



h 



g(X pi ) Xp (X^x) K fXf- p] -^ [k - p] 



U{x pi fh k -v 



h 



L 



Xpi - v 
H 



Let £(v,x) = g(v)xp{v, x)fp{v) . Then with the 0(np V ) remainders below 
being of that form uniformly in v,x E TZ^, for some 77 > 0, we have 



D 3 {v,x) = h k- PH k — — 



E 



h k ~P 



K 



X lk-P]_ xlk - P ] 



L 



Xgj -v 
H 



v [k-p] — x [k- 



+ Hh~ 1 iu^ p AL(w)dw. 



(A.14) 

Noting that by (3.7), Hh 1 = 0(np V ) for some 77 > and using the lattice 
argument, it can be proved from (A.14) that, uniformly in x G TZk, 



1 



T 1 (x) = -^-E) 



£,{Xf3i,x) 

h k "P 



K 



L /3i 



h 



(A.15) 



n P i=i 

+ 0p {(n /3 /i fc -P)- 1 / 2 }, 

where E denotes the expectation operator. More simply, moment meth- 
ods and the lattice argument can together be used to show that T 2 (x) = 
°p{{ n (3h k ~ p )~ 1 t 2 } , uniformly in x. This result and (A.15) together imply 
that, uniformly in x G TZ}-, 



t(X 0i ,x) K {X i 



h k ~P 



[ k ~P\ _ Jk-p] 
L /3i x 



/? 
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(A.16) 

+0p {(7iph k -pr 1/2 }. 

Combining (A.12), (A.13) for £ = 3,4,5 and (A.16), we find that, uni- 
formly in x E TZk , 

(fo?)(x) = Si (*) - S 2 (x) + o p {{n p h k -vy 1/2 } 

+ 0p {(nph k ~vy 1/2 }- 

[Note that 52 (x) cancels, up to terms of order o p {{nph k ~ p )~ l l 2 } , with 5i(x) — 
E{Si(x)}, except for the part of the latter that involves the errors e^.] Re- 
sult (3.10) follows directly from (A. 17). 
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